import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
General data exploration¶
- Load data and covert to appropriate types
# load data
site_data = pd.read_csv('data/SiteData.csv').convert_dtypes()
display(site_data.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13221 entries, 0 to 13220 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TIMESTAMP 13221 non-null string 1 READING 13221 non-null string 2 VALUE 13221 non-null Float64 dtypes: Float64(1), string(2) memory usage: 322.9 KB
None
- change timestap to datetime
site_data['TIMESTAMP'] = pd.to_datetime(site_data['TIMESTAMP'])
site_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13221 entries, 0 to 13220 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TIMESTAMP 13221 non-null datetime64[ns] 1 READING 13221 non-null string 2 VALUE 13221 non-null Float64 dtypes: Float64(1), datetime64[ns](1), string(1) memory usage: 322.9 KB
site_data.head()
| TIMESTAMP | READING | VALUE | |
|---|---|---|---|
| 0 | 2020-08-24 00:00:00 | PX_GENERATOR_POWER | 0.0 |
| 1 | 2020-08-24 00:00:00 | PX_RECTIFIER_LOAD | 2.3867 |
| 2 | 2020-08-24 00:00:00 | PX_GENERATOR_1_FUEL_LEVEL | 400.43 |
| 3 | 2020-08-24 00:30:00 | PX_GENERATOR_POWER | 0.0 |
| 4 | 2020-08-24 00:30:00 | PX_RECTIFIER_LOAD | 2.4072 |
- there are empty spaces in column names, fix that
site_data.columns = site_data.columns.str.strip()
site_data['TIMESTAMP'].dt.month.unique()
array([ 8, 9, 10, 11], dtype=int32)
site_data['READING'].unique()
<StringArray> ['PX_GENERATOR_POWER', 'PX_RECTIFIER_LOAD', 'PX_GENERATOR_1_FUEL_LEVEL'] Length: 3, dtype: string
# slice the data frame & create a copy for augmentation
generator = site_data[site_data['READING'] == 'PX_GENERATOR_POWER']
generator_power = generator.copy()
rectifier= site_data[site_data['READING'] == 'PX_RECTIFIER_LOAD']
rectifier_load = rectifier.copy()
fuel = site_data[site_data['READING'] == 'PX_GENERATOR_1_FUEL_LEVEL']
fuel_level = fuel.copy()
# create box plots for each reading
fig = px.box(y=site_data['VALUE'], x = site_data['READING'], color=site_data['READING'], log_y=True, title='Statistics of each reading', labels={'y': 'value', 'x': 'reading'})
fig.update_layout()
fig.update_traces(boxmean=True)
fig.show()
# Calculate average, min, and max
average_load = rectifier_load['VALUE'].mean()
min_load = rectifier_load['VALUE'].min()
max_load = rectifier_load['VALUE'].max()
print(f'average load: {average_load} kW, min load: {min_load} kW, max load: {max_load} kW')
# Calculate daily variation
rectifier_load['DATE'] = rectifier_load['TIMESTAMP'].dt.date
daily_variation = rectifier_load.groupby('DATE')['VALUE'].std()
average load: 1.8169906058543226 kW, min load: 0.4999 kW, max load: 2.9144 kW
# 1. Distribution of Rectifier Load with Average, Min, and Max
fig = px.histogram(rectifier_load, x='VALUE', nbins=50, title='Distribution of Rectifier Load with Average, Min, and Max')
fig.add_vline(x=average_load, line_dash="dash", line_color="red", annotation_text=f'Average: {average_load:.2f} kW')
fig.add_vline(x=min_load, line_dash="dash", line_color="green", annotation_text=f'Min: {min_load:.2f} kW')
fig.add_vline(x=max_load, line_dash="dash", line_color="blue", annotation_text=f'Max: {max_load:.2f} kW')
fig.update_layout(xaxis_title='Rectifier Load (kW)', yaxis_title='Frequency')
fig.show()
avg_daily_variation = daily_variation.mean()
print(f'average daily variation is {avg_daily_variation} kW')
average daily variation is 0.08713957714464081 kW
# 2. Average Daily Variation in Rectifier Load
daily_variation_data = rectifier_load.groupby('DATE')['VALUE'].std().reset_index()
fig = px.bar(daily_variation_data, x='DATE', y='VALUE', title='Daily Variation in Rectifier Load')
fig.add_hline(y=avg_daily_variation, line_dash="dash", line_color="red", annotation_text=f'Average Daily Variation: {avg_daily_variation:.2f} kW')
fig.update_layout(xaxis_title='Date', yaxis_title='Standard Deviation of Rectifier Load (kW)')
fig.show()
# plot the distribution of the daily deviation
fig = px.histogram(x=daily_variation, marginal='box', labels={'x':'standard deviation (kW)'}, title='Distribution of the daily vairation in rectifier load')
fig.update_layout()
fig.update_traces()
fig.show()
Note: although the mean variation is 0.08 kW, it is also good to have an idea of outliers which can help identify incidents with large load variation.
3. You will notice that the rectifier load has several step changes during the 3-month period. Does the generator usage on site change? How?¶
- Lets start by plotting the time series distribution for each reading with a common time stamp
- In additiona will also check the correlation of rectifier load and generator power
# Create subplots
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=['PX_GENERATOR_POWER', 'PX_RECTIFIER_LOAD'])
fig.add_trace(go.Scatter(x=generator_power['TIMESTAMP'], y=generator_power['VALUE'], mode='lines+markers', name='PX_GENERATOR_POWER'), row=1, col=1)
fig.add_trace(go.Scatter(x=rectifier_load['TIMESTAMP'], y=rectifier_load['VALUE'], mode='lines+markers', name='PX_RECTIFIER_LOAD'), row=2, col=1)
# fig.add_trace(go.Scatter(x=fuel_level['TIMESTAMP'], y=fuel_level['VALUE'], mode='lines+markers', name='PX_GENERATOR_1_FUEL_LEVEL'), row=3, col=1)
# Update layout
fig.update_layout(
title='Time Series for Each READING',
xaxis_title='Timestamp',
yaxis_title='Value',
height=600,
xaxis_rangeslider_visible=False
)
fig.show()
time-series plot: comparing the time series plots of rectifier load and generator power, it does look like there is a positive correlation betweed load and power, which make sense.- Lets investivage some scatter plots. Using the previous day to day rectifier load deviation plot, lets define a thershold of 0.02 as a step change in the rectifier load
# Calculate step changes
rectifier_load['STEP_CHANGE'] = rectifier_load['VALUE'].diff().abs()
# Define a threshold for significant step changes
threshold = 0.02
# Identify significant step changes in rectifier load
significant_rect_changes = rectifier_load[rectifier_load['STEP_CHANGE'] > threshold]
generator_power['VALUE'] = generator_power['VALUE'].astype(float)
# Merge with generator power to find corresponding values
merged_data = pd.merge_asof(significant_rect_changes.sort_values('TIMESTAMP'),
generator_power.sort_values('TIMESTAMP'),
on='TIMESTAMP', suffixes=('_rectifier', '_generator'))
# print(merged_data.columns)
# Plot the distributions where rectifier load change is above 0.02 kW
fig = go.Figure()
# Add rectifier load step changes
fig.add_trace(go.Scatter(x=merged_data['TIMESTAMP'], y=merged_data['VALUE_rectifier'], mode='markers', name='Significant Rectifier Changes', marker=dict(color='red'), yaxis='y2'))
# Add corresponding generator power
fig.add_trace(go.Scatter(x=merged_data['TIMESTAMP'], y=merged_data['VALUE_generator'], mode='markers', name='Corresponding Generator Power', marker=dict(color='blue')))
# Update layout for dual y-axis
fig.update_layout(
title='Significant Rectifier Load Step Changes and Corresponding Generator Power',
xaxis_title='Timestamp',
yaxis=dict(
title='Generator Power',
titlefont=dict(color='blue'),
tickfont=dict(color='blue')
),
yaxis2=dict(
title='Rectifier Load',
titlefont=dict(color='red'),
tickfont=dict(color='red'),
anchor='free',
overlaying='y',
side='right',
position=1
),
legend=dict(x=0, y=1.1, orientation='h')
)
fig.show()
NOTE: there are a lot of cases where power is 0 even though load is increasing. These cases can be considered as when the generator is not in use. However even when the generator is in use, the power consumption seems to fluctuate a lot. Is there a load balancer(?) that only draws power from generator when needed? So we are not seeing a pefect correclation.
Lets calculate a correlation coefficient ignoring the cases where power is 0
# correlation ignoring 0 power in generator
merged_data_ON = merged_data[merged_data['VALUE_generator'] != 0]
# Calculate correlation between rectifier step changes and generator power
correlation_ON = merged_data_ON[['STEP_CHANGE', 'VALUE_generator']].corr().iloc[0, 1]
fig = px.scatter(merged_data_ON, x='STEP_CHANGE', y='VALUE_generator',
title=f'Correlation between Rectifier Load Step Changes and Generator Power (ignoring OFF state) \nCorrelation Coefficient: {correlation_ON:.2f}',
labels={'STEP_CHANGE': 'Rectifier Load Step Change (kW)', 'VALUE_generator': 'Generator Power (kW)'},
trendline="ols")
fig.show()
We see a slightly positive correlation, which will make sense to me.
fig = go.Figure()
fig.add_trace(go.Scatter(x=fuel_level['TIMESTAMP'], y=fuel_level['VALUE'], mode='lines+markers'))
fig.show()
Identify refueling events - detect increases in fuel level¶
- Assumpition: the sensors can detect any changes more than 500 ml reliably
- Use a threshold of 0.5
fuel_level['FUEL_DIFF'] = fuel_level['VALUE'].diff()
refueling_events = fuel_level[fuel_level['FUEL_DIFF'] > 0.5] # any positive change is refueling
- Lets plot the refueling events
fig = px.line(fuel_level, x='TIMESTAMP', y='VALUE', title='Fuel Level with Refueling Events', line_shape='linear')
fig.add_trace(go.Scatter(x=refueling_events['TIMESTAMP'], y=refueling_events['VALUE'], mode='markers', name='Refueling Events', marker=dict(color='red')))
fig.show()
# Calculate the size of each refueling event
refueling_sizes = refueling_events[['TIMESTAMP', 'FUEL_DIFF']].reset_index(drop=True)
print(f'Total number of refueling = {refueling_sizes.shape[0]}')
display(refueling_sizes)
Total number of refueling = 35
| TIMESTAMP | FUEL_DIFF | |
|---|---|---|
| 0 | 2020-08-25 12:00:00 | 4.8 |
| 1 | 2020-08-25 13:30:00 | 2.73 |
| 2 | 2020-08-26 14:00:00 | 5.48 |
| 3 | 2020-08-29 11:00:00 | 3.42 |
| 4 | 2020-08-29 18:30:00 | 2.74 |
| 5 | 2020-08-31 20:30:00 | 5.48 |
| 6 | 2020-09-06 12:30:00 | 5.48 |
| 7 | 2020-09-08 09:30:00 | 0.69 |
| 8 | 2020-09-08 12:30:00 | 3.42 |
| 9 | 2020-09-11 11:30:00 | 479.04 |
| 10 | 2020-09-12 16:30:00 | 5.48 |
| 11 | 2020-09-15 14:00:00 | 6.16 |
| 12 | 2020-09-16 09:00:00 | 2.74 |
| 13 | 2020-09-16 12:30:00 | 2.74 |
| 14 | 2020-09-17 09:30:00 | 2.74 |
| 15 | 2020-09-19 15:30:00 | 5.48 |
| 16 | 2020-09-21 10:30:00 | 2.73 |
| 17 | 2020-09-23 15:00:00 | 105.39 |
| 18 | 2020-09-24 13:30:00 | 5.48 |
| 19 | 2020-10-06 08:30:00 | 2.74 |
| 20 | 2020-10-08 10:30:00 | 2.73 |
| 21 | 2020-10-08 11:30:00 | 2.74 |
| 22 | 2020-10-11 12:30:00 | 4.11 |
| 23 | 2020-10-12 11:30:00 | 4.11 |
| 24 | 2020-10-23 17:00:00 | 257.31 |
| 25 | 2020-10-25 16:00:00 | 8.9 |
| 26 | 2020-11-01 15:00:00 | 2.73 |
| 27 | 2020-11-07 17:30:00 | 166.98 |
| 28 | 2020-11-13 12:30:00 | 4.79 |
| 29 | 2020-11-17 12:30:00 | 5.47 |
| 30 | 2020-11-18 13:00:00 | 2.06 |
| 31 | 2020-11-19 15:00:00 | 5.47 |
| 32 | 2020-11-22 14:00:00 | 2.74 |
| 33 | 2020-11-23 12:00:00 | 2.73 |
| 34 | 2020-11-23 13:00:00 | 2.74 |
# Predict next refuel time (when fuel drops to 50L)
current_fuel_level = fuel_level['VALUE'].iloc[-1]
# print(current_fuel_level)
fuel_depletion_rate = fuel_level['VALUE'].diff().mean() # average rate of fuel consumption per 30 minutes
# print(fuel_depletion_rate)
time_to_next_refuel = (current_fuel_level - 50) / fuel_depletion_rate #mins
# print(time_to_next_refuel)
next_refuel_time = fuel_level['TIMESTAMP'].iloc[-1] + pd.Timedelta(minutes=time_to_next_refuel) # current time + delta time
print(f'Next refuel time = {next_refuel_time}')
Next refuel time = 2021-01-21 23:18:13.896713612
6. The generator and rectifier both display step changes over the time-period. How might you automatically detect and quantify these changes?¶
Similarly to 3, we have to define a threshold. For rectifier load we used the minimum of the daily variation as a threshold. We can do that same for generator_power
generator_power['DATE'] = generator_power['TIMESTAMP'].dt.date
power_daily_variation = generator_power.groupby('DATE')['VALUE'].std()
fig = px.histogram(x=power_daily_variation, marginal='box', labels={'x':'standard deviation (kW)'}, title='Distribution of the daily vairation in generator power')
fig.update_layout()
fig.update_traces()
fig.show()
# Re-calculate step changes in rectifier load
rectifier_load['RECTIFIER_LOAD_DIFF'] = rectifier_load['VALUE'].diff().abs()
# Step change detection for generator power
generator_power['GENERATOR_DIFF'] = generator_power['VALUE'].diff().abs()
significant_gen_changes = generator_power[generator_power['GENERATOR_DIFF'] > 1.5] # using 1.5 as a threshold based on above distribution
# Step change detection for rectifier load
significant_rect_changes = rectifier_load[rectifier_load['RECTIFIER_LOAD_DIFF'] > 0.02 ] # using 0.02 from #3
# Results
display(significant_gen_changes[['TIMESTAMP', 'VALUE', 'GENERATOR_DIFF']])
display(significant_rect_changes[['TIMESTAMP', 'VALUE', 'RECTIFIER_LOAD_DIFF']])
| TIMESTAMP | VALUE | GENERATOR_DIFF | |
|---|---|---|---|
| 9 | 2020-08-24 01:30:00 | 3.4697 | 3.4697 |
| 12 | 2020-08-24 02:00:00 | 10.4526 | 6.9829 |
| 15 | 2020-08-24 02:30:00 | 2.0011 | 8.4515 |
| 18 | 2020-08-24 03:00:00 | 0.0000 | 2.0011 |
| 33 | 2020-08-24 05:30:00 | 10.2884 | 10.1695 |
| ... | ... | ... | ... |
| 13107 | 2020-11-23 05:00:00 | 2.4821 | 6.0855 |
| 13110 | 2020-11-23 05:30:00 | 0.0000 | 2.4821 |
| 13194 | 2020-11-23 19:30:00 | 1.6934 | 1.6934 |
| 13197 | 2020-11-23 20:00:00 | 8.5276 | 6.8342 |
| 13203 | 2020-11-23 21:00:00 | 0.0000 | 8.5088 |
802 rows × 3 columns
| TIMESTAMP | VALUE | RECTIFIER_LOAD_DIFF | |
|---|---|---|---|
| 4 | 2020-08-24 00:30:00 | 2.4072 | 0.0205 |
| 10 | 2020-08-24 01:30:00 | 2.3504 | 0.0561 |
| 13 | 2020-08-24 02:00:00 | 2.3007 | 0.0497 |
| 16 | 2020-08-24 02:30:00 | 2.3592 | 0.0585 |
| 25 | 2020-08-24 04:00:00 | 2.3949 | 0.0333 |
| ... | ... | ... | ... |
| 13186 | 2020-11-23 18:00:00 | 1.7483 | 0.0389 |
| 13189 | 2020-11-23 18:30:00 | 1.7196 | 0.0287 |
| 13192 | 2020-11-23 19:00:00 | 1.5908 | 0.1288 |
| 13195 | 2020-11-23 19:30:00 | 1.7133 | 0.1225 |
| 13198 | 2020-11-23 20:00:00 | 1.6861 | 0.0272 |
2223 rows × 3 columns
Note, the threshold for each case is a choice based on the current data. We can change this threshold based on other requirements.
7. Solar power is an increasingly common power source utilised by tower companies to reduce costs (fuel) and carbon emissions. One significant cause of panel output being lower than anticipated is shading from surrounding trees, buildings or the tower itself. The image below shows how shading can impact a solar power profile – a solar power profile is generally a parabola shape, a shading profile can take many different forms, but the pattern must repeat over multiple days. How might you automatically detect and quantify this problem?¶
We can implement the following steps to quantify and detect shading
- Collect data over a week: We can analyze the daily power profiles to visualize how the daily solar power output is over the hours of the day. We can even use simulated data by generating it from a parabolic equation that reflects day time hours.
- Identify deviation from the parabolic shape (e.g. the plot above) i.e. the difference is power from step 1 over a week to make sure the pattern is repeated.
- Define a threshold of decrease in power.
- Deviation beyond this threshold will identify shading effect automatically.